Scientific Python antipatterns advent calendar day fourteen

For today, a bad habit that’s often a hold over from when we’re learning Python. As a reminder, I’ll post one tiny example per day with the intention that they should only take a couple of minutes to read.

If you want to read them all but can’t be bothered checking this website each day, sign up for the mailing list:

Sign up for the mailing list

and I’ll send a single email at the end with links to them all.

Closing files and context managers

One of the first tools that I teach complete beginners is how to open/read/write files. Many real world problems obviously rely on file IO, but also many other tools and examples only make sense if we can read and write files.

Getting started with file reading is easy:

fruit_file = open('fruit.txt')
fruit = fruit_file.read()
fruit
'banana'

File output is a bit tricker; as many readers will know, we have to close the file after writing:

colour_file = open('colour.txt', 'w')
colour_file.write('red\n')
colour_file.close()

but it’s easy to forget, especially in the context of a larger program:

colour_file = open('colour.txt', 'w')
# some code
colour_file.write('red\n')
# some more code
# forget to close the file
4

Now we have a piece of code that has different behaviour in different environments.

If we run the above code as a standalone script i.e. in its own interpreter, then the missing close will not cause a problem. Python processes close open files automatically when they exit.

However, if we run the exact same code in an environment where the Python interpreter keeps running - like an interactive console, or a Jupyter notebook - the file won’t get closed and the output will not be written.

But if we then run the same code chunk a second time, the open line causes the previously open file to be closed, and so the output gets written.

What this means in practice is that when running code like the one above in a notebook environment, we are always seeing the output from the previous time that the code was executed. As you can guess, this makes it very hard to debug the output if it’s not exactly what we want!

Even if we have the close, the output can be broken by something unrelated happening in the code. If we are unlucky enough to have a line that causes an error before the close, it will not be reached, and the output will not be written:

colour_file = open('colour.txt', 'w')
colour_file.write('yellow\n')

# this line raises a ValueError
int('five')

colour_file.close()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 5
      2 colour_file.write('yellow\n')
      4 # this line raises a ValueError
----> 5 int('five')
      7 colour_file.close()

ValueError: invalid literal for int() with base 10: 'five'

In the above bit of code we change our output from red to yellow, but the code crashes before we reach the close and so the output file does not change (until we run the code a second time!)

We can solve the crashing problem by using the finally block:

colour_file = open('colour.txt', 'w')
try:
    colour_file.write('green\n')  
    # this line raises a ValueError
    int('five')
finally:
    colour_file.close()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 5
      3     colour_file.write('green\n')  
      4     # this line raises a ValueError
----> 5     int('five')
      6 finally:
      7     colour_file.close()

ValueError: invalid literal for int() with base 10: 'five'

This tells Python to always close the file, even if our code crashes, but adds a lot of clutter, and still relies on us remembering to add the close manually. So the Pythonic way to do it is with a context manager, which implements this logic internally:

with open('colour.txt', 'w') as colour_file:
    colour_file.write('blue\n')  
    # this line raises a ValueError
    int('five')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 4
      2 colour_file.write('blue\n')  
      3 # this line raises a ValueError
----> 4 int('five')

ValueError: invalid literal for int() with base 10: 'five'

The with pattern makes the behaviour of our code much easier to reason about, by making sure that the file is closed (and hence the output written) regardless of whether we are running in an interactive environment or not, and even if there is a crash somewhere else in the code.

A reasonable question: given these tricky debugging problems that we have just discussed, why not just teach beginners to use with right from the start? Two reasons:

  • I find that for novice programmers, it’s actually quite useful to think about having to explicitly close files, as it help to reinforce the fact that writing to a file conceptually and actually having the bytes written to the disk are two separate processes. This gives a nice insight into the relationship between our program an the operating system that mediates access to the hardware on which it’s running.

  • The with block has some complexity in its own right! When introducing complete beginners to code, I usually want to get some sort of file output before talking about indented blocks, colons, etc. and without having to teach a second way of assigning variables (as) right at the start.

One more time; if you want to see the rest of these little write-ups, sign up for the mailing list:

Sign up for the mailing list